# GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

## 🗞️ Abstract
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO).
This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks.


## 🛠️ Setup

```bash
conda create -n gui-rise python=3.10
conda activate gui-rise
bash setup.sh
```

## 📦 Setup Datasets

### Navigtion datasets
- Download [GUIAct](https://huggingface.co/datasets/yiye2023/GUIAct) then use our `prepare/hf_guiact.ipynb` to create metadata for each split (i.e., web, mobile).

- Set up Mind2Web, AITW, Miniwob follow [SeeClick's Instruction](https://github.com/njucckevin/SeeClick/blob/main/agent_tasks/readme_agent.md). 

Then, the dataset should be organized as following:
```
$_DATA_DIR
    - GUIAct
        - images
        - metadata
    - Mind2Web
        - images
        - metadata
    - AITW
        - images
        - metadata
    - MiniWob
        - images
        - metadata
```

Fistly, use our `src/open-r1-multimodal/src/open_r1/generate_thinking_data.py` to generate the thinking data as thinking pseudo labels.
Then use our `src/open-r1-multimodal/src/open_r1/prepare_Mind2Web_data.py` or other prepare code (prepare_AITW_data, prepare_GUIAct_data, prepare_MiniWob_data) to process them and get the metadata.


## 〽️Start Navigation Training

#### 📚 Cold Start
> [!NOTE] 
> If you training cold start model, please follow [ShowUI's Instruction](https://github.com/showlab/ShowUI/blob/main/TRAIN.md).
> Before training the cold start model, use src/open-r1-multimodal/src/open_r1/generate_thinking_data.py to generate thinking pseudo labels and using ShowUI's training code.
```
deepspeed --num_gpus=8 --master_port 5678 train.py \
  --wandb_key=$WANDB_KEY \
  --model_id='Qwen/Qwen2-VL-2B-Instruct' \
  --version='Qwen/Qwen2-VL-2B-Instruct' \
  --local_weight \
  --local_weight_dir=$_MODEL_DIR \
  --dataset_dir=$_DATA_DIR \
  --log_base_dir=$_SAVE_DIR \
  --epochs=3 \
  --steps_per_epoch=100 \
  --batch_size=1 \
  --grad_accumulation_steps=2 \
  --model_max_length=8192 \
  --exp_id=$_EXP_NAME \
  --train_ratio="1"  \
  --train_dataset="mind2web"  \
  --train_json="hf_train_cold_start_data"   \
  --val_dataset="mind2web"  \
  --val_json="hf_test_full"    \
  --precision="fp32" \
  --attn_imple="sdpa" \
  --workers=0 \
  --lora_r=32 \
  --lora_alpha=64  \
  --min_visual_tokens=1344  \
  --max_visual_tokens=1680  \
  --num_turn=100 \
  --random_sample \
  --record_sample \
  --lr=0.00001 \
  --uniform_prompt  \
  --ds_zero="zero2" \
  --gradient_checkpointing  \
  --lm_skip_ratio=0.5   \
  --lm_skip_layer='[1,28,0]'    \
  --num_history=4    \
  --interleaved_history='tttt'
```


#### 📚 Reinforcement Learning
> [!NOTE] 
> If you encounter 'CUDA out of memory' error, you can try to (1) set `gradient_checkpointing` as `true`, (2) reduce the `per_device_train_batch_size`, or (3) use lora.

```bash
cd src/open-r1-multimodal/scr/open_r1

bash run_grpo_gui_lora.sh
```

#### 📊 Evaluation
To evaluate the model, please follow [ShowUI's Evaluation](https://github.com/showlab/ShowUI/blob/main/TRAIN.md) and using ShowUI's training code.
```
deepspeed --num_gpus=8 --master_port 5678 train.py \
  --wandb_key=$WANDB_KEY \
  --model_id='Qwen/Qwen2-VL-2B-Instruct' \
  --version='Qwen/Qwen2-VL-2B-Instruct' \
  --local_weight \
  --local_weight_dir=$_MODEL_DIR \
  --dataset_dir=$_DATA_DIR \
  --log_base_dir=$_SAVE_DIR \
  --epochs=3 \
  --steps_per_epoch=100 \
  --batch_size=1 \
  --grad_accumulation_steps=2 \
  --model_max_length=8192 \
  --exp_id=$_EXP_NAME \
  --train_ratio="1"  \
  --train_dataset="mind2web"  \
  --train_json="hf_train"   \
  --val_dataset="mind2web"  \
  --val_json="hf_test_full"    \
  --precision="fp32" \
  --attn_imple="sdpa" \
  --workers=0 \
  --lora_r=0 \
  --lora_alpha=64  \
  --min_visual_tokens=1344  \
  --max_visual_tokens=1680  \
  --num_turn=100 \
  --random_sample \
  --record_sample \
  --lr=0.00001 \
  --uniform_prompt  \
  --ds_zero="zero2" \
  --gradient_checkpointing  \
  --lm_skip_ratio=0.5   \
  --lm_skip_layer='[1,28,0]'    \
  --num_history=4    \
  --interleaved_history='tttt' \
  --eval_only
```

## 🤝 Acknowledgements

We would like to express our sincere gratitude to [Open-R1](https://github.com/huggingface/open-r1), [VLM-R1](https://github.com/om-ai-lab/VLM-R1) [QwenVL](https://github.com/QwenLM/Qwen2.5-VL), and [ShowUI](https://github.com/showlab/ShowUI) for providing open-source resources that contributed to the development of this project.